Data Science for Biologists
Associate Professor in Data Science and Genetics at the University of East Anglia.
Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.
Teach Genetics, Programming, and Statistics
One workshop per week
One lecture per week
One assignment per week
One ‘capstone’ project
I hope you end up with more questions than answers!
For Research to be reproducible both data and methods should be available.
Applying the described methods to the data leads to the same results
In theory, method availability ≠ code
But with complex data and analyses - are methods of data collection enough?
Science advances incrementally by identifying and rectifying errors over time
Peer review: Critical evaluation of papers by experts maintain quality
Independent studies either support or fail to replicate findings
Publication bias: preference for positive results
Pressure to publish
Poor study designs and statistical issues
Lack of transparency
The reproducibility crisis emerged when numerous studies, especially in fields like psychology, medicine, and biology, failed to be replicated by other researchers.
High-profile replication attempts revealed that many published results could not be consistently reproduced, raising doubts about their validity.
Recognition that no study should be considered ‘definitive’
Empower lasting systemic change through increased transparency in research methods, data sharing and reporting
Structural change in academic culture
Open science is a global movement that aims to make scientific research and its outcomes freely accessible to everyone. By fostering practices like data sharing and preregistration, open science not only accelerates scientific progress but also strengthens trust in research findings.
UK Reproducibility Network - funded by UK Research Council
46 member institutions (UEA is one)
Establish open research practices across UK Research
/home/phil/Documents/paper
├── abstract.R
├── correlation.png
├── data.csv
├── data2.csv
├── fig1.png
├── figure 2 (copy).png
├── figure.png
├── figure1.png
├── figure10.png
├── partial data.csv
├── script.R
└── script_final.R
README
Documented
Easy to code with
All files are inside the root folder
What do you think are the contents of these files:
data/raw/madrid_minimum-temperature.csv
scripts/02_compute_mean-temperature.R
analysis/01_madrid_minimum-temperature_descriptive-statistics.qmd
Come up with good names for these:
a dataset of cats with columns for weight, length, tail length, fur colour(s), fur type and name.
a script that downloads data from Spotify.
a scripts that cleans up data.
a scripts that fits a linear discriminant model and saves it to a file.
Use projects
Check your code runs on blank slates
Automates the creation of a paper or report
Saves time
Reduces errors
(https://www.nature.com/articles/d41586-022-00563-z)
What is a Statistical Model?
A model is a simplified representation of real-world processes.
It helps us describe, explain, and predict outcomes.
A good fit makes accurate predictions; a poor fit can lead to misleading conclusions.
To make reliable inferences, the model must accurately represent the data.
“Later, we’ll see how models help us test hypotheses using p-values to assess if the data fits our expectations.”
Population:
The entire group you want to study (e.g., all humans, all mice in a lab).
Studying the entire population is often impractical due to time, cost, or logistics.
Sample:
A subset of the population, selected to make inferences about the whole.
Must be representative to ensure accurate conclusions.
Summarize data to highlight key features:
Central Tendency: Where is the “center” of the data? (mean, median, mode)
Spread: How variable are the data? (variance, standard deviation)
Helps us understand the data before making inferences.
The central tendency of a series of observations is a measure of the “middle value”
The three most commonly reported measures of central tendency are the sample mean, median, and mode.
| Mean | Median | Mode |
|---|---|---|
| The average value | The middle value | The most frequent value |
| Sum of the total divided by n | The middle value (if n is odd). The average of the two central values (if n is even) | The most frequent value |
| Most common reported measure, affected by outliers | Less influenced by outliers, improves as n increases | Less common |
One of the simplest statistical models in biology is the mean
| Lecturer | Friends |
|---|---|
| Mark | 5 |
| Tony | 3 |
| Becky | 3 |
| Ellen | 2 |
| Phil | 1 |
| Mean | 2.6 |
| Median | 3 |
| Mode | 3 |
Calculating the mean:
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \] \[\frac{5 + 3 + 3 + 2 + 1}{5} = 2.6\]
Sum all the values and divide by n of values
One of the simplest statistical models in biology is the mean
| Lecturer | Friends |
|---|---|
| Mark | 5 |
| Tony | 3 |
| Becky | 3 |
| Ellen | 2 |
| Phil | 1 |
| Mean | 2.6 |
| Median | 3 |
| Mode | 3 |
We already know this is a hypothetical value as you can’t have 2.6 friends (I think?)
Now with any model we have to know how well it fits/ how accurate it is
If data is symmetrically distributed, the mean and median will be close, especially as n increases.
This is also know as a normal distribution
Variance:
The average of the squared differences from the mean.
\[s{^2}_{sample} = \frac{\sum(x - \bar{x})^2}{N -1}\] Higher variance = more spread.
Standard Deviation (SD):
The square root of variance.
Easier to interpret because it’s in the same units as the data.
| Lecturer | Friends | Residuals | Sq Resid |
|---|---|---|---|
| Mark | 5 | 2.4 | 5.76 |
| Tony | 3 | 0.4 | 0.16 |
| Becky | 3 | 0.4 | 0.16 |
| Ellen | 2 | -0.6 | 0.36 |
| Phil | 1 | -1.6 | 2.56 |
| Mean | 2.6 |
Sum of Squared Residuals = 9
N-1 = 4
Variance = 9/4 = 2.25
Square root of sample variance
A measure of dispersion of the sample
Smaller SD = more values closer to mean, larger SD = greater data spread from mean
variance:
\[ \sigma = \sqrt{\sum(x - \overline x)^2\over n - 1} \] ## N-1? {.smaller}
For a population the variance \(S_p^2\) is exactly the mean squared distance of the values from the population mean
\[ s{^2}_{pop} = \frac{\sum(x - \bar{x})^2}{N} \]
But this is a biased estimate for the population variance
A biased sample variance will underestimate population variance
n-1 (if you take a large enough sample size, will correct for this)
| Lecturer | Friends | Diff | Squared diff |
|---|---|---|---|
| Mark | 5 | 2.4 | 5.76 |
| Tony | 3 | 0.4 | 0.16 |
| Becky | 3 | 0.4 | 0.16 |
| Ellen | 2 | -0.6 | 0.36 |
| Phil | 1 | -1.6 | 2.56 |
| Mean | 2.6 | ||
| variance | 2.25 | ||
| SD | 1.5 |
Small \(s\) = data points are clustered near the mean
Large \(s\) = data points are widely dispersed around the mean
Shape: Symmetrical, bell-shaped curve
Described by just two parameters: mean (μ) and standard deviation (σ).
Rule of Thumb: For normally distributed data:
~68% within 1 SD
~95% within 2 SDs
~99.7% within 3 SDs
Relevance: Many statistical tests assume data follows a normal distribution. This helps us calculate probabilities—like p-values—to test hypotheses.
If we assume a normal distribution (or close enough), we can calculate the probability of observing any given value using just the mean and standard deviation.
\[ f(x) = \frac{1}{\sqrt{2\pi} \, \sigma} \exp\!\Biggl(-\frac{(x - \mu)^2}{2\,\sigma^2}\Biggr) \]
This has applications in hypothesis testing and building confidence intervals
Histograms plot frequency/density of observations within bins
Quantile-Quantile plots plot quantiles of a dataset vs. quantiles of a theoretical (usually normal) distribution
Problem: Different datasets have different means and standard deviations.
Solution: Standardization allows comparisons by converting any normal distribution into a standard normal distribution (mean = 0, SD = 1).
\[ Z = \frac{X - \mu}{\sigma} \]
Z: How many standard deviations a value is from the mean.
X: Observed value.
μ: Population mean.
σ: Population standard deviation.
The standard normal distribution allows us to calculate probabilities of observing extreme values.
Example: If 𝑍 = 2, the observation is 2 standard deviations above the mean.
Using a Z-table, we can find the probability of getting a result at least this extreme.
This concept extends to p-values, which tell us how rare our data is under the null hypothesis.
A p-value is the probability of obtaining results at least as extreme as the ones we observed, assuming the null hypothesis is true.
It helps quantify how surprising or unusual our data is under the null hypothesis.
A low p-value suggests that the observed data would be rare if the null hypothesis were true, which may lead us to question the null hypothesis
Imagine flipping a coin 100 times, expecting about 50 heads if it’s fair (the null hypothesis). If you observe 90 heads, the p-value tells you how likely it is to get such an extreme result just by chance.
Example: A new drug has no effect compared to a placebo.
Example: The new drug does improve patient outcomes compared to the placebo.
The shaded areas in the tails represent extreme outcomes (typically the most unexpected 5% if using an \(\alpha\) = 0.05).
If your observed data falls within these tails, it’s considered statistically significant, suggesting it’s unlikely to occur by random chance under the null hypothesis.
Key Point: Larger sample sizes reduce variability, making it easier to detect small differences as statistically significant.
However, statistical significance doesn’t always mean practical importance—especially with large samples.
Scenario: Testing if diet A and diet B affect mice longevity.
Null Hypothesis (H₀): No difference in longevity between diets.
Alternative Hypothesis (H₁): There is a difference in longevity.
lm(longevity ~ diet, data = mice)
Explanation: This linear model performs a t-test to assess if the mean lifespans differ significantly between diets.
The t-distribution shows where your observed mean difference falls relative to what’s expected under the null hypothesis.
Confidence Intervals (CI): The red area marks the 95% CI. If zero falls outside this range, it suggests statistical significance.
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.19 0.276 15.2 3.87e-13 3.62 4.77
2 dietmice_b -1.41 0.391 -3.60 1.60e- 3 -2.22 -0.595
“Diets can change the longevity of mice (p = 0.0016).”
“Mice on diet B lived significantly shorter lives than mice on diet A (t22 = -3.6, p = 0.006).”
“Mice on diet B had a reduced mean lifespan of 1.41 years[95% CI; -0.595:-2.22] from the mice on diet A (mean 4.19 years (95% CI, 3.62-4.77). While statistically significant (t22 = -3.6, p = 0.006), this is a relatively small sample size, and further testing is recommended to confirm this effect.”
Which has the greatest level of useful detail?
A p-value is NOT:
The probability that the null hypothesis is true.
The probability your results occurred “by chance.”
Proof of a meaningful or large effect.
What it IS:
In reality, we can’t know for sure if a true mean difference exists.
For illustration: Assume we could know the true mean difference.
The figure shows:
Grey line: Expected data if the null hypothesis is true.
Black line: Expected data if the alternative hypothesis is true.
A p-value shows how surprising the data are if the null is true.
A low p-value is evidence against the null, not proof of the alternative.
What we can conclude, based on our data, is that we have observed an extreme outcome, that should be considered surprising. But such an outcome is not impossible when the null-hypothesis is true.
If we plot the null model for a very large sample size, we can see that even very small mean differences will be considered ‘surprising’.
However, just because data is surprising, does not mean we need to care about it. It is mainly the verbal label ‘significant’ that causes confusion here – it is perhaps less confusing to think of a ‘significant’ effect as a ‘surprising’ effect.
Discovering Statistics - Andy Field
An Introduction to Generalized Linear Models - Dobson & Barnett
An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani
Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.
Ecological Statistics with contemporary theory and application
The Big Book of R (https://www.bigbookofr.com/)
Writing statistical methods for ecologists
Reporting statistical methods and outcome of statistical analyses in research articles
Design principles for data analysis
Log-transformation and its implications for data analysis.
Effect size, confidence interval and statistical significance: a practical guide for biologists
Misconceptions, Misuses, and Misinterpretations of P Values and Significance Testing
Ten common statistical mistakes to watch out for when writing or reviewing a manuscript.
Why most published research findings are false
Model averaging and muddled multimodel inference
A brief introduction to mixed effects modelling and multi-model inference in ecology
The Practical Alternative to the p Value Is the Correctly Used p Value